A Reason to Add Registers

نویسندگان

  • Steve Bennett
  • David Melski
  • Jim Larus
چکیده

The increasing gap between CPU and memory performance forces us to reevaluate design choices made in the past to optimize the CPU-memory interface. Ideally, all CPU operand traffic would remain on-chip. It is obviously not feasible to have the number of registers be on the order of main memory size and it is not clear that increasing L1 cache sizes has significant benefit.[10] Current instruction set architectures (ISAs) provide at most 32 registers for the software to explicitly manage. There have been proposals to increase the number of registers, but it is not clear that more registers are useful, as traditional optimizations/transformations and register allocators were not designed to take advantage of a large number of registers. We have investigated the feasibility and utility of providing a larger area of fast memory which the compiler controls. The form of memory which we investigated in this project is registers. As an example of an transformation that takes advantage of a larger register set, we studied scalar replacement. Our goals in this project were to demonstrate three ideas: (1) there are realistic ways to increase the number of registers in an ISA without expanding the number of bits required to address the registers, (2) scalar replacement is an effective way of increasing the number of registers used, and (3) compared to results reported by Callahan, Carr and Kennedy[11] performance of the Livermore Loops transformed by scalar replacement improves when using a larger number of registers. Our experimental results do not allow us to justify all of these hypotheses, however. Our measured speedups on a number of kernels varied widely and did not match up well with those reported by Calahan, Carr and Kehnnedy. Though we question the performance changes we measured, all of our observations lead us to believe that the expanded register set has concrete performance benefits and is realizable. Based on our experience, we advocate further study into optimizations and transformations which improve performance while increasing register usage. 1.0 Introduction The memory hierarchy in modern computer systems is made up of a number of distinct blocks; some are hardware controlled and some are software/compiler controlled. The table below describes the main elements of the hierarchy. We present this information to make a simple observation: the topmost level of the memory hierarchy is compiler controlled, yet it is the most scarce resource. The cache is normally relied upon to provide performance, yet it’s organization is normally invisible to the compiler and it’s effectiveness seems to be declining.[10] Additionally, memory accesses (through a cache or not) complicate compiler analysis and hardware optimizations and are generally higher-latency operations than those that use registers since an address calculation and memory load/store are necessary to work with these operands (even if they are in the cache). For these reasons, minimizing the number of memory accesses is an important way to enable other optimizations, most importantly code scheduling. Code scheduling may be improved for two reasons. First, eliminating instructions may eliminate dependencies (both false and true) enabling additional code motion. Second, difficult to analyze memory access are removed from consideration. Optimizing compilers attempt to allocate variables to the CPU registers. Because registers are few in number, all variables can not be coalesced into registers for their entire lives. Additionally, aliased and dynamic memory can not be stored in registers because registers and memory form separate address spaces in most architectures. Though registers can not replace other components of the memory hierarchy such as the cache, it may be reasonable to assume that increasing the number of registers will allow the compiler to allocate more variables to registers for longer periods of time, minimizing the amount of memory traffic and improving the latency of operations using the register residing variables. From the architectural standpoint, adding registers introduces problems for circuit and instruction set designers. At the circuit level, a larger register set means longer access times, more chip real estate consumed by storage and possibly an impact on clock cycle time or pipeline design. In today’s 32-bit RISC instructions and existing CISC instruction sets, the number of bits in an instruction for use in addressing registers is limited (see [19] and [16] for examples). Hence using traditional register design methodologies, an arbitrarily large set of registers is not feasible. As architectures move toward 64-bit instructions, there will be a call to expand the instruction formats to 64 bits as well. While this would provide an opportunity to simplify instruction encodings and expand register addressing bits, it also increases the instruction fetch bandwidth requirement. This may not be a desirable trade-off. To make large register sets feasible, we must find novel ways to introduce registers without impacting other aspects of the design. From the software standpoint, adding registers raises some interesting questions. Standard register allocators were designed to work with relatively small register sets; the standard compliment is 32 general purpose and 32 floating point registers in most modern processors. It is not clear that current allocators are capable of efficiently dealing with more. We must determine how a compiler or programmer could use a large set of registers. From a system’s standpoint, there are a number of issues that must be addressed. The most difficult is the effect of a large register set on context switch times. Any increased cost of saving and restoring extra registers is not out-weighed by performance improvements. We envision a larger register file being practical and beneficial in environments which service long running numerical applications. These environments can amitorize the cost of context switches over a very long quanta since they are not intended for interactive use. Interactive users could be force to use only a subset of the registers available to batch programs. Component Compiler Visible? Size Location Registers Yes Very Small (~32 elements) On-chip Cache No Moderate (~1kB 16kB) Onor off-chip Main Memory Yes Large (8MB 1GB) Off-chip Table 1: Memory Hierarchy Components While these questions are important, they are not fundemental in deciding if more registers are an effective addition to the processor architecture. In this paper, we address the single most important issue: Can more registers improve performance? In the next section, we review relevant previous work. Then in Section 3.0, we detail the work that we have done for this project. In Section 4.0, we present results of our experiments before concluding the paper in Section 5.0. 2.0 Previous Work It is well known that optimization of the memory interface is crucial in achieving high-performance. Register allocators that we are familiar with today [13][14][22] are the result of attempts to optimize the use of the fastest component of the memory hierarchy. With the advent of RISC processors, compilers had 32 or more registers to allocate variables to. The allocation schemes in use today concentrate on minimizing register pressure, a measure of the demand on the register set, by analyzing the lifetimes of variables and allocating a subset of variables to registers while they are live. The problem of packing variables into registers is NP-complete; heuristics allocate as many variables as possible to registers while consuming as little compilation time as possible. Some other phases of the compiler may concern themselves with register pressure as well, attempting to minimize the number of distinct register instances they create. Examples of this concern for the number of registers available are limiting loop unrolling based on the number of registers required and limiting array accesses that are replaced when doing scalar replacement. [11] Some recent studies have considered using registers or other high speed, compiler controlled memory resource to temporarily hold values which traditional register allocation schemes do not allocate to registers. This includes array elements and dynamically allocated variables. Holding these variables in registers has two major advantages: (1) memory traffic is minimized by improving the amount of reuse of register values, and (2) code scheduling is improved by eliminating dependencies and difficult to analyze references. Both of these aspects are critical in today’s computing environments; in supercomputers and even microprocessors a non-register operand fetch may cost hundreds of CPU cycles. Sohi and Hsu [20] discuss allocating array variables temporarily to vector register slots in Cray vector supercomputers. They use the vector registers as high speed, compiler (programmer) controlled cache. Their results indicate that this optimization is beneficial on the kernels studied. Austin, Vijaykumar and Sohi’s Knapsack study [9] showed that the compiler can find large numbers of variables in real programs which can be allocated to compiler controlled portions of the memory hierarchy, but it is not clear how this translates into improved system performance. The fundamental idea is important: the compiler knows a great deal at compile time regarding the optimal placement of variables but is limited by hardware structures which it can not control, namely the cache. Callahan, Carr and Kennedy studied a source-to-source compiler transformation they named scalar replacement [11]. This transformation replaces array accesses with scalar temporary variables it introduces into the program. This increases register pressure since the target compiler attempts to allocate these new temporary variables to registers. The algorithm they describe uses an estimated register pressure measurement to limit replacement in cases where more registers would be needed than were available. If accesses are successfully moved to register residing scalar temporary variables, memory traffic is reduced, the number of instructions in the inner loop may decrease and there will be additional opportunities to schedule the resulting code. This work is similar in spirit to Sohi and Hsu, but moves well beyond the hand optimized assembly code used in Sohi and Hsu’s work to include automatic detection of variables to be replaced using data dependence analysis. Combined with loop unrolling, Calahan, Carr and Kennedy found that this transformation displayed good performance improvements on the kernels tested. Individual Livermore loops showed as much as a 63% increase in performance and the 24 kernels improved 12% on average. The authors noted that the number of registers available was sometimes the limiting factor in the replacement algorithms effectiveness. Loop unrolling can increase static code size drastically. Callahan, Carr and Kennedy indicate that limiting the number of registers consumed by their transformed code seems to control code size well. The initial study did not look at a large group of application packages, instead concentrating on the Livermore loops. A later study [12] looked at conditional control flow and the performance of larger group of application programs. Performance improvements for these application programs was less encouraging. Average improvement in performance was 3%. Some work has indicated that more than 32 registers does not improve performance by a significant amount. For scalar programs, the evidence for this statement is strong. Mahlke, et al. studied a number of scalar programs written in C and compiled on a full-featured optimizing compiler. They found that little performance benefit was realized after approximately 16 registers.[18] They state that there may be other optimization which will increase register usage, but they do not elaborate. We found little work which showed that more registers were ineffective for numerical applications, however. One exception was that of Kiyohara, et al.[17] In this work, the authors studied 3 floating point benchmarks and concluded that there were only small performance wins to be found by increasing the number of registers above 32. We believe that this work was flawed, however, because an increase in registers was not accompanied by additional optimizations and transformations to take advantage of them. 3.0 Project Summary 3.1 Overview We assert that a key to high performance is to provide the compiler with fast hardware resources which it can explicitly control and then provide it with the tools and algorithms to analyze the program being compiled so that the hardware resources can be used effectively. Without techniques to effectively utilize a compiler controlled component of the memory hierarchy, expansion of that component will not improve (and may degrade) performance. To improve performance, an increase in the number of registers must be accompanied by compilation techniques to take advantage of them. There are already a number of optimizations and transformations which tend to increase register usage. The most prominent techniques are • Common Subexpression Elimination (CSE) Using this technique, expression computation redundancy is reduced by storing the result of a computation for an extended period of time. These values are stored on the stack or in registers. • Procedure Inlining By incorporating procedure body code into the calling procedure, the size of the procedure increases, increasing opportunities for CSE, global code scheduling and register allocation. These optimizations can increase register pressure. • Loop Unrolling This optimization, combined with traditional optimizations such as code scheduling and register allocation can increase register pressure. See the Appendix for an example showing this effect of loop unrolling. • Scalar Replacement In this transformation, array variable references are reduced by keeping array elements in registers across iterations of the loop.[11] • Superblock Scheduling By increasing the size of the blocks that the instruction scheduler works within, this technique will increase register lifetimes and hence register pressure. Additionally, the proponents of this technique advocate use of other register use increasing optimizations such as static register renaming, loop unrolling and induction variable expansion.[15] Our goal in this project is to show examples of concrete performance improvements gained by coupling an increased number of registers with algorithms effective in increasing register usage. Our project proceeded in 4 distinct phases. First, we evaluated the feasibility of adding registers to an architecture by studying possible implementation techniques. Second, we modified gcc and an existing superscalar simulator to study the addition of registers. Third, we implemented scalar replacement as a source-to-source transformation using SUIF. Lastly, we evaluated the effectiveness of the scalar replacement transformation on a simulated architecture with more than 32 registers. 3.2 Implementation Methods Kiyohara, et al. detail a method of adding registers to an existing ISA by embedding mapping instructions into the static code.[12] In their study, they show that these mapping instructions have little negative impact on the dynamic execution of the program, while providing a larger register address space. Space and time limitations prevent us from fully exploring these issues; see their paper for a discussion of the architectural and software system considerations of this approach. We are concerned with the compiler’s role in utilizing additional registers, hence we assume that the architectural and system issues can be approached in a reasonable manner. 3.3 Compiler and Simulator Modification We modified a version of GNU gcc and gas targeted at a simulated architecture called SimpleScalar (developed by Todd Austin at the University of Wisconsin). SimpleScalar is a MIPS based architecture which supports the Unix system calls. While modifying gcc and gas, we found that assumptions about the configuration of the target architecture exist throughout the back end of the compiler and in the intermediate language. This made supporting more than 32 registers in our MIPS based gcc and gas a difficult task. One part of gcc that could not be changed to recognize additional registers was the parameter passing code generation. The complication here is that some library routines are written in assembly language targeted at the standard MIPS architecture (both hardware and software). Hence we could not change the passing conventions which define which registers are used to pass parameters to functions. We believe this incomplete recognition of the new registers has little impact on overall performance though we did not investigate this issue further. The machine modeled by the simulator used in this project is configured as follows: • Issue widths are set at 4 • The state maintenance mechanism is an RUU[20] with full operand bypass. • The register file, no matter how large, is accessible in a single cycle. • Functional Units are fully pipelined, capable of starting a new instruction each cycle. • A 1024 entry branch target buffer with a 2-bit counter predictor is modeled • Dynamic memory disambiguation is performed in the hardware • A 16kB Data cache is explicitly modeled; the instruction cache is infinite. • No virtual memory translation is performed. 3.4 Scalar Replacement Implementation We implemented the basic scalar replacement algorithm [11] using the SUIF compiler system.[8] SUIF provides a detailed dependence analysis framework. SUIF is a source-to-source translator, reading either C or FORTRAN code and emitting C code after optimizations are performed. This C code is then compiled with our modified version of gcc for use in the simulator. Using the existing tools for analyzing and manipulating SUIF’s internal representation, our code searches for array references in innermost loops and performs basic scalar replacement as described in the Callahan, Carr and Kennedy paper. For details on this algorithm, please see the original paper. We did not perform any replacement if conditional control flow was present in the innermost loop.[12] Addressing this problem requires more complex analysis which time constraints did not allow us to address.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DyVSoR: dynamic malware detection based on extracting patterns from value sets of registers

To control the exponential growth of malware files, security analysts pursue dynamic approaches that automatically identify and analyze malicious software samples. Obfuscation and polymorphism employed by malwares make it difficult for signature-based systems to detect sophisticated malware files. The dynamic analysis or run-time behavior provides a better technique to identify the threat. In t...

متن کامل

Fast Implementations of Shared Objects using Fetch&Add

In this paper we present efficient implementations of shared objects from Fetch&Add registers. We present the first snapshot implementation which has time complexity O(1) for both scan and update. Using this implementation we can directly obtain an active set with constant time complexity for all three operations, join, leave and getSet. We also present the first universal construction which ha...

متن کامل

VLSI architectures for the MAP algorithm

This paper presents several techniques for the VLSI implementation of the MAP algorithm. In general, knowledge about the implementation of the Viterbi algorithm can be applied to the MAP algorithm. Bounds are derived for the dynamic range of the state metrics which enable the designer to optimize the word length. The computational kernel of the algorithm is the Add-MAX* operation, which is the ...

متن کامل

Data Path Allocation Techniques for High-level Synthesis of Low BIST Area Overhead Designs

Built-in self-test (BIST) techniques have evolved as cost-eeective techniques for testing digital circuits. These techniques add test circuitry to the chip such that the chip has the capability to test itself. A prime concern in using BIST is the area overhead due to the modiication of normal registers to BIST registers. This paper proposes a high-level synthesis methodology that addresses this...

متن کامل

Is Compare-and-Swap Really Necessary?

The consensus number of a synchronization primitive, such as compare-and-swap or fetch-and-add, is the maximum number of processes n among which binary consensus can be solved by using read-write registers and registers supporting the synchronization primitive. As per Herlihy’s seminal result, any synchronization primitive with consensus number n can be used to construct a wait-free and lineari...

متن کامل

Robust Fuzzy Content Based Regularization Technique in Super Resolution Imaging

Super-resolution (SR) aims to overcome the ill-posed conditions of image acquisition. SR facilitates scene recognition from low-resolution image(s). Generally assumes that high and low resolution images share similar intrinsic geometries. Various approaches have tried to aggregate the informative details of multiple low-resolution images into a high-resolution one. In this paper, we present a n...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995